這 SIMT(單一指令、多線程) 模型是 GPU 架構的核心。雖然您撰寫的是單個線程,但硬體會將它們協調成兩層階層的「網格」與「區塊」。 網格 與 區塊。為了最大化效率,硬體會進一步將這些區塊分割為每單位 32 個線程的「瓦片」。 瓦片。
1. SIMT 與 SIMD 比較
與 CPU 的 SIMD(如 SSE/AVX)不同,後者需手動將資料打包至暫存器;SIMT 允許線程在軟體層面上表現為獨立。硬體會自動將線程分組成瓦片,並同時為全部 32 個線程取得同一條指令,以同步執行。
2. 線性化規則
程式設計師使用 threadIdx.x, y, z 進行邏輯運算,但硬體會將其轉換為一維序列以進行排程:
索引值 = x + (y × blockDim.x) + (z × blockDim.x × blockDim.y)
由於 x 維度 是變化最快的索引,因此具有連續 threadIdx.x 值的線程通常會落在同一個瓦片中,這對於 記憶體合併。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the physical scheduling 'atom' (minimum unit) in NVIDIA's SIMT model?
A single Thread
A Warp (32 threads)
A Streaming Multiprocessor
A Grid
✅ Correct!
The hardware fetches one instruction for a warp of 32 threads. It is the minimum scheduling unit.❌ Incorrect
While you write code for single threads, the hardware dispatches them in groups of 32 (Warps).QUESTION 2
In an 8x8 thread block, which threads will be assigned to Warp 0?
Threads with linear IDs 0 through 31
Threads with y=0 and y=1
Only threads where x=0
All 64 threads belong to Warp 0
✅ Correct!
Hardware linearizes the block (row-major based on x) and takes the first 32 threads for Warp 0.❌ Incorrect
Warp partitioning is based on the linearized 1D index, regardless of dimensions.QUESTION 3
How does SIMT differ from traditional SIMD (like SSE/AVX)?
SIMT requires manual packing of vector registers.
SIMT allows threads to operate independently at the software level while hardware manages vectorization.
SIMD is only for GPUs; SIMT is for CPUs.
There is no difference; they are synonymous.
✅ Correct!
SIMT abstracts the hardware vectorization, allowing for more flexible control flow and easier programming.❌ Incorrect
Manual packing is a hallmark of SIMD. SIMT handles grouping via hardware warps.QUESTION 4
Using the linearization formula, what is the ID of T(2, 1, 0) in a block with blockDim.x=16 and blockDim.y=16?
3
18
32
17
✅ Correct!
Index = 2 + (1 * 16) + (0 * 16 * 16) = 18.❌ Incorrect
Follow the formula: x + (y * blockDim.x) + (z * blockDim.x * blockDim.y).QUESTION 5
What happens if threads within a warp take different execution paths (e.g., an if-else statement)?
The warp executes all paths in parallel without penalty.
The warp splits into two warps.
The hardware serializes the paths, disabling threads not on the current path.
The kernel crashes.
✅ Correct!
This is control flow divergence; it reduces efficiency because the paths are executed sequentially.❌ Incorrect
Hardware cannot split a warp; it must execute divergent paths one after another.Architectural Analysis: Linearization and Bandwidth
Applying Warp Partitioning to Matrix Addition
You are optimizing a matrix addition kernel on a 2D grid. The threads are organized into 8x8 blocks. You are considering using shared memory to improve performance.
Q
1. [Reading Context: Figure 6.1 shows an example of placing threads of a two-dimensional (2D) block into linear order.] Draw out/Represent the partitioning of an 8x8 thread block into warps. Which thread index (x,y) marks the end of Warp 0?
Solution:
In an 8x8 block (64 threads), threads are linearized using $Index = tx + (ty \times 8)$. Warp 0 covers Linear IDs 0 to 31. T(0,0) is ID 0. ID 31 is calculated as $tx=7, ty=3$ ($7 + 3 \times 8 = 31$). Therefore, Warp 0 starts at $T(0,0)$ and ends with $T(7,3)$. Warp 1 begins at $T(0,4)$ (ID 32) and ends at $T(7,7)$ (ID 63).
In an 8x8 block (64 threads), threads are linearized using $Index = tx + (ty \times 8)$. Warp 0 covers Linear IDs 0 to 31. T(0,0) is ID 0. ID 31 is calculated as $tx=7, ty=3$ ($7 + 3 \times 8 = 31$). Therefore, Warp 0 starts at $T(0,0)$ and ends with $T(7,3)$. Warp 1 begins at $T(0,4)$ (ID 32) and ends at $T(7,7)$ (ID 63).
Q
2. Consider the matrix addition where each element of the output matrix is the sum of the corresponding elements of the two input matrices. Can one use shared memory to reduce the global memory bandwidth consumption? Explain with details (approx. 150 words).
Solution:
No, shared memory cannot reduce global memory bandwidth for matrix addition. In matrix addition, each output element $C[i][j]$ is the sum of $A[i][j] + B[i][j]$. Each element from the input matrices $A$ and $B$ is accessed exactly once by exactly one thread to compute its unique output. Shared memory is a high-speed scratchpad designed for data reuse—where multiple threads within a block read the same global memory address repeatedly (as seen in matrix multiplication). Since there is zero data commonality between threads in matrix addition, loading data into shared memory first would actually increase overhead. It would require one global load to shared memory, a `__syncthreads()` barrier, and then a shared memory load before the addition. This adds instruction count and synchronization latency without decreasing the number of global memory transactions. To optimize matrix addition, one should focus on global memory coalescing rather than shared memory utilization.
No, shared memory cannot reduce global memory bandwidth for matrix addition. In matrix addition, each output element $C[i][j]$ is the sum of $A[i][j] + B[i][j]$. Each element from the input matrices $A$ and $B$ is accessed exactly once by exactly one thread to compute its unique output. Shared memory is a high-speed scratchpad designed for data reuse—where multiple threads within a block read the same global memory address repeatedly (as seen in matrix multiplication). Since there is zero data commonality between threads in matrix addition, loading data into shared memory first would actually increase overhead. It would require one global load to shared memory, a `__syncthreads()` barrier, and then a shared memory load before the addition. This adds instruction count and synchronization latency without decreasing the number of global memory transactions. To optimize matrix addition, one should focus on global memory coalescing rather than shared memory utilization.